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1. Introduction 

Let {Xi,Yi)i<n be a sample of the variable set (X,Y) where Y is an indicator 
variable and X is an explanatory variable. Conditionally on X, Y follows a 
Bernoulli distribution with parameter p(a;) = Pr(y — l\X = x). Usual examples 
are response variables y to a dose X or to an expository time X, economic 
indicators. The variable X may be observed at fixed values Xi, i £ {1, . . . , m} 
on a regular grid {1/m, . . . , 1} or at irregular fixed or random times tj^j < n, 
for a continuous process {Xt)t<T- 

Exponential linear models with known link functions are often used, espe- 
cially the logistic regression model defined by p{x) — €^'^^''{1 + e''^^^^}^^ with a 
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parametric function tp. The inverse function of p is easily estimated using max- 
imum likelihood estimators of the parameters and many authors have studied 
confidence sets for the parameters and the quantiles of the model. 

In a nonparametric setting and for discrete sampling design with several 
independent observations for each value Xj of X , the likelihood is written 

n m n 

Ln = i[p{x,)^^{i-p{x,}y-^^ = n l[[{p{x,)}^^{i-p{x,)y-^^]'^^^-.K 

i=i j=ii=i 

The maximum likelihood estimator ofp{xj) is the proportion of individuals with 

1 as X^'i ^ Xj J 

1 " 

Pln{Xj) = Yn-^. = = ^^^^{Jfi^x,},^ = 1,...,TO. 

Regular versions of this estimator are obtained by kernel smoothing or by 
projections on a regular basis of functions, especially if the variable X is con- 
tinuous. Let K denote a symmetric positive kernel with integral 1, /i = /in a 
bandwidth and Kh{x) — h~^K{h~^x), with /i„ — > as n — > oo. A local maxi- 
mum likelihood estimator of p is defined as 

1 " 

P2n{x) = ^ — — - V YiKh{x - X,) 

or by higher order polynomial approximations [5]. 

Under regularity conditions oip and K and ergodicity of the process {Xt, Yt)t>o, 
the estimator p2n is P-uniformly consistent and asymptotically Gaussian. When 
p is monotone, the estimators are asymptotically monotone in probability. For 
large n, the inverse function q is then estimated by = sup{a; : Pnix) < u} 

if p is decreasing or by $"„(u) = inf{a; : p„{x) > u} if p is increasing. The es- 
timator Qn is also P- uniformly consistent and asymptotically Gaussian [7]. For 
small samples, a monotone version of p„ using the greatest convex minorant or 
the smallest concave majorant algorithm may be used before defining a direct 
inverse. Other nonparametric inverse functions have been defined |1J. 

Under bias sampling, censoring or truncation, the distribution function of Y 
conditionally on X is not always identifiable. The paper studies several cases 
and defines new estimators of conditional and marginal distributions, for a con- 
tinuous bivariate set {X,Y) and for a conditional Bernoulli variable Y. 



2. Bias depending on the value of Y 

In case-control studies, individuals are not uniformly sampled in the population: 
for rare events, they are sampled so that the cases of interest (individuals with 
Yi = 1) are sufficiently represented in the sample but the proportion of cases in 
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the sample differs from its proportion in the general population [6]. Let Si be 
the sampling indicator of individual i in the global population and 

Ft{S, = = 1) - Ai, Pr(5, = = 0) = Ao. 

The distribution fmiction of (Si, Yi) conditionally on Xi = x is given by 

Pr(5, = l,y, = l|a;) = Pr(5, = = 1) Pr(r, = = Aip(a;), 
Pr(5, = l,y, = 0|.t) = Pr(5, = l|y, = 0)Pr(r, = 0|.T) = Ao{l-p(a;)}, 
Pr(5, = l|a;) = Pr(5, = 1, 1^: = + Pr(5, = 1, = 0|a;) 
= Xip{x) + Xo{l - p{x)}. 

Let 

Ai p[x) 

For individual i, (Xi, Yi) is observed conditionally on Si = 1 and the conditional 
distribution function of Yi is defined by 

nix) = PriY, = l\S, = l,X = x) = —^^^^ — 

\ip(x) + Aojl - p[x)\ 

p{x) 1 



p{x) + e{l - p{x)} 1 + aix)' 

The probability p{x) is deduced from 9 and t:{x) by the relation 

_ Onix) 

^ l + {9- l)7r{x) 

and the bias sampling is 

(1 -6')7r(a;)(l -7r(2;)) 



7r{x) — p{x) — 



1 + (0 - \)tt{x) 



The model defined by (Ap, Ai,p(x)) is over-parameterized and only the func- 
tion a is identifiable. The proportion 9 must therefore be known or estimated 
from a preliminary study before an estimation of the probability function p. 
In the logistic regression model, tj^ix) — log[p(x){l — p(x)}~^] is replaced by 
loga(a;) = log[7r(a;){l — 7r(a;)}~-^] = ip{x) — log 6*. Obviously, the bias sampling 
modifies the parameters of the model but not this model and the only stable 
parametric model is the logistic regression. 

Let 7 be the inverse of the proportion of cases in the population, 

^ = Pr(y = 0)/ Pr(y = 1) = E{1 - Y)/EY = ^ ~^ ■ (1) 

J p{x) dFx[x) 

Under the bias sampling, 
Pr(y, = 115, = 1) 

Pr(y, = 0\S^ = 1) 



Ai / pdFx _ 1 



Xo{l^ J pdFx) + Xi J pdFx l + 9j' 

Xojl- J pdFx) ^ 9-1 

Xoil- J pdFx) + Xi J pdFx 1 + ^7' 
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7 is modified by the scale parameter r]: it becomes Pr(F = 015*= 1)/Pr(y = 
1|S'= 1) =6*7. 

The product 9^ may be directly estimated by maximization of the likelihood 
and 

^7„ = l-%!^ii^. 

l^t l{s. = i} 

In a discrete sampling design with several independent observations for fixed 
values Xj of the variable X, the likelihood is 

n m n 

l[n{X,)''^{l-n{X,)y-''^ = Y[l[[n{X,f^{l-7r{X,)y-''^]'^-*-^y 

i=l j—1 i—1 

and aj = cy{xj) is estimated by 

E.(l-^.)l{S. = l}l{X,=x,} 



Eii"il{X.=x,}l{S, = l} 



For random observations of the variable X, or for fixed observations without 
replications, a{x) is estimated by 

^ . ._ Eia-Y^)l{s.=l}Kh{x-Xi) 

If 6 is known, nonparametric estimators of p are deduced as 

Pn{xj) = ^ , in the discrete case, 

z^i(i - + uri)L^s,=i}'-{x,=xj} 

Ei(l - + ^'^i)l{Si=i}-f^/»(a; - ^i) 
3. Bias due to truncation on X 

Consider that Y is observed under a fixed truncation of X: we assume that 
{X, Y) is observed only if X G [a, 6] , a sub- interval of the support Ix of the 
variable X, and /S = l[a Then 

Pr(yi = 1) = / p{x) dFx{x), Pv{Yi = l,Si = l)= [ p{x) dFx{x) 

J Ix J a 

and the conditional probabilities of sampling, given the status value, are 

{^p(x)dFx{x) 



jj^p{x) dFx{x)' 



. {l-p(x)}dFx(x) 
Ao = Pr 5, = lYi = = t ^ r \ 'l.j, , ! ■ 
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If the ratio 6 = is known or otherwise estimated, the previous estimators 

may be used for the estimation of p{x) from the tnmcated sample with Si = 1. 

For a random truncation interval [A, B] , the sampling indicator is S = 
^[A,B]{^) and the integrals of p are replaced by their expectation with respect 
to the distribution function of A and B and the estimation is similar. 



4. Truncation of a response variable in a nonpareimetric regression 
model 

Consider then {X, Y) a two-dimensional variable in a left-truncated transfor- 
mation model: Let Y denote a response to a continuous expository variable X, 
up to a variable of individual variations e independent of X, 

Y = m{X) +s, Ee = 0, Ee^ < oo, 

{X,e) with distribution function {Fx,Fg). The distribution function of y con- 
ditionally on X is defined by 

FY\x{y:^) = P{Y<y\X = x)=F,{y-m{x)), (2) 
m{x) = E{Y\X = x), 

and the function m is continuous. The joint and marginal distribution functions 
of X and Y are denoted Fx,y, with support Iy,x, Fx, with bounded sup- 
port Ix, and Fy, such that Fyiy) = / F^{y — m{s)) dFx{s) and Fx,Y{x,y) = 
J'^{s<x}Fs{y - m{s))dFx{s). 

The observation of Y is supposed left- truncated by a variable T independent 
of {X,Y), with distribution function Ft : Y and T are observed conditionally 
on y > r and none of the variables is observed iiY<T. Denote F = 1 — F for 
any distribution function F and, under left-truncation, 

/oo 
F,{y-m{x))dFT{y), 
-OO 

A{y;x) = P{Y <y\X = x,T <Y) 
rv 

= a-\x) / Ft{v) dFe{v - m{x)) (3) 

^ — OO 

B{y; x) = P(T <y< Y\X = x,T<Y) 

= a-\x)FT{y)F,{y~m{x)), (4) 

m*{x) = E{Y\X=x,T<Y)=a-\x) j yFT{y)dFY\x{y;x). 



Obviously, the mean of Y is biased under the truncation and a direct estimation 
of the conditional distribution function Fy\x is of interest for the estimation of 
m(a;) = E{Y\X = x) instead of the apparent mean m*{x). The function F^ is 
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also written exp{— A^} with Ae(?/) = J^^F^^dF^ and the expressions (l3|-([4| 
of A and B imply that 

fV 

Ae(y-m(x))= / B^^{s;x) A{ds;x) 

J — OO 

and FY\x{y;x) = exp{-A^{y - m(x))}. 

An estimator oi Fyixiy] is obtained as the product-limit estimator F^^n{y— 
m{x)) of F^{y—m{x)) based on estimators of A and B: For a sample (X^, i^)i<i<n, 
let X in Ix.n,h = [min^ Xi + h, max.; Xi — /i] and 



An{y]x) 

Bn{y;x) 



127=1 Khix~Xi)I[T,<Y.} 
J2i=l Kh{x - Xj)I{T,<y<Y^] ^ 
m=lKh{x - Xi)I{T,<Yi} 



l<Yi<y 



1 n ji-y-^fr^fvp^l' 

' T,j=l Kh (X - Xj)I{T, <Y<Y,} J 



l<i<n 



with 0/0 = 0. That is a nonparametric maximum likelihood estimator of -FVlXi 
as is the Kaplan-Meier estimator for the distribution function of a right-censored 
variable. Then an estimator of 'm{x) may be defined as an estimator of J y FY\x{dy 



fhnix) = ^YiI^T^<Yi}{FY\X,n{yi^,x) - FYlX.ni^i x)} 
i=l 

Eti Y^I{T.<Y.}Kh{x - X,) Fy\xJY-;x) 



J2j=l Kh{x ~ Xj)I[Tj<Yi<Y,} 



(6) 



By the same arguments, from means in ©-(HI), Fyiy) — EF^{y — m{X)) is 
estimated by 

FYAy)- n U- J'T^^'' I 

l<^<n{ l^3 = lhT,<Y<Y,} ] 

the distribution function Ft is simply estimated by the product-limit estimator 
for right-truncated variables [TU] 



n 



I{t<Ti<Yi} 



l<i<r 



and an estimator of F^ is deduced from those of Fy\x, Fx and m as 
F^,n{s)=n^^ ^ FY\x,n{s + mn{Xi);Xi). 



l<i<r 
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The means of T and C are estimated by 

tJ-T,n =n > -™ , IJ,Y,n — ^ }, — F^Ti j • 

i=l l^j=lHT,<Ti<Y,} ^ 2^j = l-'{T,<y.<K,} 

The estimators Fy.n and Ft.u are known to be P-uniformly consistent and 
asymptotically Gaussian. For the further convergence restricted to the interval 
In,h = {{y,x) e Iy,x ■■ X G Ix,n,h}-, assume 

Condition 4.1 CI. h = hn and nh^ —> oo as n —> oo, J/C = 1, 

Ki = / x'^K{x) dx and K2 = / K"^ < oo. 

C2. the conditional probability a is strictly positive in the interior of Ix, 
C3. The distribution function Fy^x is twice continuously differentiable with re- 
spect to X and differentiable with respect to y. 
CI E£^+^ <oo for aS m (1/2, 1]. 

Let us denote FY,x,2{y,x) = dFYix{y,x)/dx, FY,x,2{.y,x) = d^FY\x{y,x)/dx^, 
and FYix,i{y,x) = dFY\x{y,x)/dy. 

Proposition 4.1 supj^ ^ \An — A| and supj^ ^ — B\ ^ 0, 
bih{y\x) = {EAn-A)iy;x) = ^^Ki-lJ FT{v)FY,x,2{dv,dx) 

-A{y;x) J FT{v)FY,x,2{dv,dx)^ +o{h^), 
b^^iy; x) = {E% - B){y- x) = -^>,,{FTiy) J FyxAdv, dx) dx 

-B{y;x) J FT{v)FY,x,2{dv,dx)} + o{h^), 

Vnhiy^x) = varA„(y; a;) = (n/i)"^K2^(l - ^)(y;a;)a"^ (a;) +o((n/i)"^), 
v^t^{y;x) = YB.rBn{y;x) = {nh)-^K2B{l-B){y-x)a-^{x) + o{{nh)-^). 

If nh^ — »■ 0, (n/i)^/^(A„ — A) and {nhY/'^{Bn — B) converge in distribution 

to Gaussian processes with mean zero, variances k,2A{\ — A){y] x)a~^ {x) and 
K^Biyl — B){y;x)a~^{x) respectively, and the covariances of the limiting pro- 
cesses are zero. 

The proof relies on an expansion of the form 

(n/i)i/2(l„ _ A){y-x) = {nhf'^c-\x){{an - a){y;x) - A(c„ - c){x)} + 0^2(1) 
with An = a^^cin and Bn = c:^^bn, where 

n n 

Cn{x) = n'^^^Khix - Xi)I[Ti<Yi}, an{y; x) = n~^^Kh{x - Xi)Ii^Ti<Yi<y}, 

i=l j=l 
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bn{y; x) = ^ Kh{x - Xi)I{Ti<y<Yi}- 
1=1 

A similar approximation holds for _B„. The biases and variances are deduced 
from those of each term and the weak convergences are established as in [J . 
From proposition 14.11 and applying the results of the nonparametric regres- 



Proposition 4.2 The estimators FY\x.n7 "^n; Fe.n converge P-uniformly to 
PyIX: Fe7 V-Y,n o.nd /iT,n Converge P-uniformly to EY and ET respectively. 

The weak convergence of the estimated distribution function of truncated sur- 
vival data was proved in several papers [HE]- As in [3] and by proposition 14.11 
their proof extends to their weak convergence on (min^jyi : Ti < Yi},maxi{Yi : 
Ti < Yi}) under the conditions f Ft dFY\x < oo and ^ Fy^^dFT < oo on 
Ix.n,h, which are simply satisfied if for every x in Ix,n,in inf{i : Exit) > 0} < 
m{{t : Eyixit; x) > 0} and sup{t : FY\x{t\ x) > 0} < sup{t : Frit) > 0}. 

Theorem 4.1 {nhy/^{FY\xm-FY\x)li„ 

^ converges weakly to a centered Gaus- 
sian process W on Iy,x ■ The variables {nhY/^{fhn—rn){x), for every x in Ix,n.h, 
and {nhy/^CjiY^n^EY) converge weakly to EW {Y ] x) and E J W (Y ; x) dEx (x) . 

If m is supposed monotone with inverse function r, X is written X = r{Y — e) 
and the quantiles of X are defined by the inverse functions qi and q2 of Fy\x 
at fixed y and x, respectively, are defined by the equivalence between 



FY\x{y]x) = u and 



X = r{y - Qe{u)) = qi{u;y) 
y ^ m{x) -\- Qe{u) = q2{u;x), 



where Qe{u) is the inverse of E^ at u. Finally, if m is increasing, FY\x{y]x) is 
decreasing in x and increasing in y, and it is the same for its estimator FY\x,m 
up to a random set of small probability. The thresholds qi and q2 are estimated 

by 

qi,n,h{u;y) = sup{a; : i^Y|x,n(2/;a;) < u}, 
q2.n,h{u;x) = iiif{y : FY\x,n{u;x) > u}. 

As a consequence of theorem 14.11 and generalizing known results on quantiles 



Theorem 4.2 For k ~ 1,2, qk.n,h converges P-uniformly to qk on FY^x.n{In.h)- 
For every y and (respect.) x, {nh)''-/'^{qi^n,h ~ 9i)(-;y) o.nd {nhY/'^{q2,nJi - 
q2){-;x) converge weakly to the centered Gaussian process Woqi [FY\x.i{y] <li{']y)]^^ 
and, respect., W o q2[FY\x,2{q2{-\x);x)]~'^ . 

5. Truncation and censoring of 1^ in a nonparametric model 



The variable Y is supposed left-truncated by T and right-censored by a variable 
C independent of {X, Y, T). The notations a and those of the joint and marginal 
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distribution function of X,Y and T are in section [4] and Fc is the distribution 
function of C. The observations are 6 = l{i'<c}7 C*, T), conditionally 

on y A C > T. Let 

A{y;x) = P(Y <y AC\X ^ x,T <Y) 

= a-\x) r FT{v)Fc{v)FY\x{dv-x) 

J —OO 

B{y;x) = P{T <y <Y AC\X ^ x,T <Y) 
= a-\x)FTiy)Fciy)FYix{y;x), 

FY\x{y;x) = exp{- / B^^{v;x)A{dv;x)}. 
The estimators are now written 

B t \ TT il ^h{x ~ Xi)IiT,<Y,<yA.C,} 1 

FY\xAy'^) II — F"7 f\7 ^ (' 



m„(a;) 



l<i<ri 



I Z.j=i^{T,<y,<y,AC,} I 



If F is only right-truncated by C independent of {X,Y), with observations 
{X, Y) and C conditionally on F < C, the expressions a, A and B are now 
written 

/OO 
^c(2;)i^y|jf(d2/;a;), 
-OO 

A{y;x) = P(y<t/|X = a;,y <C)=a-i(a;) r FcHJ^y|x(rf^^;a^), 



= P(y < C < y|X = a;, y < C) 
= a'\x) f FY\x{v-x)dFc{v). 



The distribution function Fc and £^re both identifiable and their expression 
differs from the previous ones, 

Fc = exp{- / EB-'^{v;X)EA'{dv-X)}, 

'J —OO 

/OO 
B-\iKx)A{dv-x)}. 



The estimators are now 



^y|x,«(y;a;) = ^ ^ ~ v^h — ^^7 ^Fvr f' 



I{Y,<Ci<y} 
l<i<n I ^'j=rhYi<Y,<C,} 
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FcAy) = n l^-y-^ 

If F is left and right-truncated by variables T and C independent and in- 
dependent of {X,Y), the observations are {X,Y), C and T, conditionally on 

T <Y <C, 

/OO 
FT{y)Fc{y)FY\x{dy;x), 
-OO 

A{y;x) = P{Y <y\X = x,T <Y <C) 

= a-\x) r FT{v)Fc{v)FY\x{dv;x), 

J — OO 

B{y- x) = P{T <y< Y\X = x,T <Y <C) 

= a-\x)FT{y) Fc{v) FY\x{dv-x), 



A'{y) = P{y<T\T<Y <C)= dFrit) / FcrfFy, 
B'{y) = P{Y <y<C\T<Y <C)= Fc{y) f Ft dFY, 

J —OO 

B"{y) = P{C <y\T<Y <C)= f {[ Ft{v) dFY{v)} dFc{s). 

J —OO J —CO 

The distribution functions Fq, Ft and Fy\x are identifiable, with Fy\x defined 
by FYix{y; x) = - J^^ Fc' dH{-, x) and 

/■OO _ ry 

H{y-x) = / Fc{v) dFYixidv; x) =eyip{- B-'^{v;x) A{dv;x)}, 

J y J — OO 

Fc{s) = exp{- / r B'-UB"}, 

J J — OO 

/OO 
{EB{-,X))-UA'}]. 

Their estimators are 

B / \ TT 1 1 I{T,<Y,<C,<s} 1 

[ l^j = lHT,<Y,<CU<C,} J 



I{Ti<Yi<Ci<t} 



HTi<Ti<Yj<Cj} 



^ , . EtihUYi)im<Y<c.^y}Kt,{x-Xi)HY\xAYi-;x) 

FY\x{y;x) = = — — — — ; 

Z^j=l ^h[X - Xj)l{Tj<Yi<Y,<C,} 
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HY\x{y;x) ^ [[S^-^ — T7-f FT7 (■ 

The other nonparametric estimators of the introduction and the results of sec- 
tion [5] generahze to aU the estimators of this section. 

Right and left-truncated distribution functions Fy\x a-nd the truncation dis- 
tributions are estimated in a closed form by the solutions a self-consistency 
equation [51 [S] . The estimators still have asymptotically Gaussian limits even 
with dependent truncation distributions, when the martingale theory for point 
processes does not apply. 



6. Observation by interval 



Consider model ([2]) with an independent censoring variable C for Y. For obser- 
vations by intervals, only C and the indicators that Y belongs to the interval 
] — oo, C] or ]C, oo[ are observed. The function Fy\x is not directly identifiable 
and efficient estimators for m and Fy\x are maximum likelihood estimators. 
Let i5 = I{Y<c} and assume that F^ is C^. Conditionally on C and X ^ x, the 
log-likelihood of {6, C) is 

1{S, C)^S log F,{C - m{x)) + (1 - S) log F, (C ^ m{x)) 

and its derivatives with respect to m{x) and Fg^ are 

Li.){S,C) - -5^(C-m{x)) + {l-S)^{C~m{x)), 

/ ^ ^ 'adF, Jc-m.(x)"'"'^<^ 
l,a{S,C) = S^—^ ^ + {1-6)- "'"^ 



F,{C-m{x)) ' 'F^{C~m{x)) 

for every a s.t. / adFs — and / dFe < oo. With ap — —fefr^^ ^^'^f — lm{x) 
then lm{x) belongs to the tangent space for Ff, and the estimator of m{x) = 
E{Y\X = x) must be determined from the estimator of F^ through the condi- 
tional probability function of the observations 

B{t;x) ^ P{Y <C <t\X ^ x) ^ I F^{s~m{x))dFc(s). 

J — OO 

Let Fc^n the empirical estimator of Fc and 

R (. . _ TJUKhjx - X,)I{Y<c.<t} 
2^',=iKh{x- Xi) 

an estimator F^^nit — rn(a;)) of F^^n{t ~ m{x)) is deduced by deconvolution and 
mn{x) = / tdF^^n{t - m{x)). 
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